Import des dépendances principales panda, seaborn et matplotlib
In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
- créer une variable mydata à partir de la lecture du fichier csv avec
pd.read_csv('wine-quality-white-and-red.csv') 2. réaliser une copie de "mydata" en "df"
In [2]:
mydata=pd.read_csv('wine-quality-white-and-red.csv')
df=mydata.copy()
print(df.head(5))
type fixed acidity volatile acidity citric acid residual sugar \ 0 white 7.0 0.27 0.36 20.7 1 white 6.3 0.30 0.34 1.6 2 white 8.1 0.28 0.40 6.9 3 white 7.2 0.23 0.32 8.5 4 white 7.2 0.23 0.32 8.5 chlorides free sulfur dioxide total sulfur dioxide density pH \ 0 0.045 45.0 170.0 1.0010 3.00 1 0.049 14.0 132.0 0.9940 3.30 2 0.050 30.0 97.0 0.9951 3.26 3 0.058 47.0 186.0 0.9956 3.19 4 0.058 47.0 186.0 0.9956 3.19 sulphates alcohol quality 0 0.45 8.8 6 1 0.49 9.5 6 2 0.44 10.1 6 3 0.40 9.9 6 4 0.40 9.9 6
- Afficher une colonne particulière du jeu de données
In [3]:
colonne_acidity=df["fixed acidity"]
print(colonne_acidity)
0 7.0
1 6.3
2 8.1
3 7.2
4 7.2
...
6492 6.2
6493 5.9
6494 6.3
6495 5.9
6496 6.0
Name: fixed acidity, Length: 6497, dtype: float64
- Afficher une valeur particulière
In [4]:
valeur_4_5 = df.iloc[4,5]
print(valeur_4_5)
0.058
Afficher uniquement les vins de type "rouge" puis "blancs"
In [5]:
vins_rouges=df[df["type"] == "red"]
print(vins_rouges)
vins_blancs=df[df["type"] == "white"]
print(vins_blancs)
type fixed acidity volatile acidity citric acid residual sugar \
4898 red 7.4 0.700 0.00 1.9
4899 red 7.8 0.880 0.00 2.6
4900 red 7.8 0.760 0.04 2.3
4901 red 11.2 0.280 0.56 1.9
4902 red 7.4 0.700 0.00 1.9
... ... ... ... ... ...
6492 red 6.2 0.600 0.08 2.0
6493 red 5.9 0.550 0.10 2.2
6494 red 6.3 0.510 0.13 2.3
6495 red 5.9 0.645 0.12 2.0
6496 red 6.0 0.310 0.47 3.6
chlorides free sulfur dioxide total sulfur dioxide density pH \
4898 0.076 11.0 34.0 0.99780 3.51
4899 0.098 25.0 67.0 0.99680 3.20
4900 0.092 15.0 54.0 0.99700 3.26
4901 0.075 17.0 60.0 0.99800 3.16
4902 0.076 11.0 34.0 0.99780 3.51
... ... ... ... ... ...
6492 0.090 32.0 44.0 0.99490 3.45
6493 0.062 39.0 51.0 0.99512 3.52
6494 0.076 29.0 40.0 0.99574 3.42
6495 0.075 32.0 44.0 0.99547 3.57
6496 0.067 18.0 42.0 0.99549 3.39
sulphates alcohol quality
4898 0.56 9.4 5
4899 0.68 9.8 5
4900 0.65 9.8 5
4901 0.58 9.8 6
4902 0.56 9.4 5
... ... ... ...
6492 0.58 10.5 5
6493 0.76 11.2 6
6494 0.75 11.0 6
6495 0.71 10.2 5
6496 0.66 11.0 6
[1599 rows x 13 columns]
type fixed acidity volatile acidity citric acid residual sugar \
0 white 7.0 0.27 0.36 20.7
1 white 6.3 0.30 0.34 1.6
2 white 8.1 0.28 0.40 6.9
3 white 7.2 0.23 0.32 8.5
4 white 7.2 0.23 0.32 8.5
... ... ... ... ... ...
4893 white 6.2 0.21 0.29 1.6
4894 white 6.6 0.32 0.36 8.0
4895 white 6.5 0.24 0.19 1.2
4896 white 5.5 0.29 0.30 1.1
4897 white 6.0 0.21 0.38 0.8
chlorides free sulfur dioxide total sulfur dioxide density pH \
0 0.045 45.0 170.0 1.00100 3.00
1 0.049 14.0 132.0 0.99400 3.30
2 0.050 30.0 97.0 0.99510 3.26
3 0.058 47.0 186.0 0.99560 3.19
4 0.058 47.0 186.0 0.99560 3.19
... ... ... ... ... ...
4893 0.039 24.0 92.0 0.99114 3.27
4894 0.047 57.0 168.0 0.99490 3.15
4895 0.041 30.0 111.0 0.99254 2.99
4896 0.022 20.0 110.0 0.98869 3.34
4897 0.020 22.0 98.0 0.98941 3.26
sulphates alcohol quality
0 0.45 8.8 6
1 0.49 9.5 6
2 0.44 10.1 6
3 0.40 9.9 6
4 0.40 9.9 6
... ... ... ...
4893 0.50 11.2 6
4894 0.46 9.6 5
4895 0.46 9.4 6
4896 0.38 12.8 7
4897 0.32 11.8 6
[4898 rows x 13 columns]
- Citer les données catégoriques du dataset
In [6]:
categorical_columns = df.select_dtypes(include=["object", "int"]).columns.tolist()
print("Colonnes catégoriques :", categorical_columns)
Colonnes catégoriques : ['type', 'quality']
- Décrire le dataset à partir de la fonction describe
In [7]:
print(df.describe())
fixed acidity volatile acidity citric acid residual sugar \
count 6497.000000 6497.000000 6497.000000 6497.000000
mean 7.215307 0.339666 0.318633 5.443235
std 1.296434 0.164636 0.145318 4.757804
min 3.800000 0.080000 0.000000 0.600000
25% 6.400000 0.230000 0.250000 1.800000
50% 7.000000 0.290000 0.310000 3.000000
75% 7.700000 0.400000 0.390000 8.100000
max 15.900000 1.580000 1.660000 65.800000
chlorides free sulfur dioxide total sulfur dioxide density \
count 6497.000000 6497.000000 6497.000000 6497.000000
mean 0.056034 30.525319 115.744574 0.994697
std 0.035034 17.749400 56.521855 0.002999
min 0.009000 1.000000 6.000000 0.987110
25% 0.038000 17.000000 77.000000 0.992340
50% 0.047000 29.000000 118.000000 0.994890
75% 0.065000 41.000000 156.000000 0.996990
max 0.611000 289.000000 440.000000 1.038980
pH sulphates alcohol quality
count 6497.000000 6497.000000 6497.000000 6497.000000
mean 3.218501 0.531268 10.491801 5.818378
std 0.160787 0.148806 1.192712 0.873255
min 2.720000 0.220000 8.000000 3.000000
25% 3.110000 0.430000 9.500000 5.000000
50% 3.210000 0.510000 10.300000 6.000000
75% 3.320000 0.600000 11.300000 6.000000
max 4.010000 2.000000 14.900000 9.000000
9- Analyser le dataset avec les différentes options de pairplot
In [8]:
sns.pairplot(df)
Out[8]:
<seaborn.axisgrid.PairGrid at 0x1fbaa29dc40>
In [9]:
sns.pairplot(df, hue='type')
Out[9]:
<seaborn.axisgrid.PairGrid at 0x1fbb2e5e2a0>
In [10]:
custom_palette = {3: 'blue', 4: 'green', 5: 'yellow', 6: 'red', 7: 'purple'}
sns.pairplot(df, hue='quality',height=10, palette="hls" )
Out[10]:
<seaborn.axisgrid.PairGrid at 0x1fbc1886d20>